A Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts
نویسندگان
چکیده
A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The fourth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts. A general purpose text corpus should satisfy quality criteria at a number of levels in order to fulfill the needs of the majority of its users. The major users we take to be linguists working in academia, or to some extent in commercial enterprises, such as dictionary publishing. Although corpora exist for many languages, and new corpora are being created all the time, it is surprising how little effort is put into making them fully useful as tools. While creating the Oslo Corpus of Tagged Norwegian Texts, we found that the following criteria ought to be fulfilled: 1) Variety of corpus texts Given that it is impossible to know beforehand what kind of questions linguists (morphologists, syntacticians, semanticists, lexicographers) will want to ask, and to what extent dictionary writers will use the corpus, it is important to ensure that the corpus has a certain size and comprises a variety of genres, and of written standards, if there are any. 2) Variety of grammatical tagging It is vital that grammatical tagging goes beyond simple part-of-speech tagging: In many languages, there is, even within parts of speech, a lot of homonymy that could be disambiguated with a more fine-grained system of tags. Obviously, the tagging also has to be correct. 3) Variety of search options Every feature of the corpus, marked or unmarked, should be searchable. E.g., if the corpus consists of several genres, and the …
منابع مشابه
An Advanced Speech Corpus for Norwegian
This paper describes a new Norwegian speech corpus – The NoTa Corpus – that exhibits a variety of useful and advanced features. It contains 900 000 words of transcribed, lemmatised and POS tagged Oslo speech (carefully selected to cover many speech varieties), which is linked directly to audio and video. It has advanced search interfaces both for searches and results presentations. Since corpor...
متن کاملThe ASK Corpus - a Language Learner Corpus of Norwegian as a Second Language
In our paper we present the design and interface of ASK, a language learner corpus of Norwegian as a second language which contains essays collected from language tests on two different proficiency levels as well as personal data from the test takers. In addition, the corpus also contains texts and relevant personal data from native Norwegians as control data. The texts as well as the personal ...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملThe Effect of User-Friendly Texts vs. Impersonal and Hybrid Texts on the Reading Comprehension Ability of Iranian EFL Learners
This study focuses on the effect of user-friendly, impersonal, and hybrid texts on the reading comprehension ability of Iranian foreign language learners. Forty-five students of AlzahraUniversity were selected on the basis of their performance in a recent TOEFL. They were given three different texts (each group of 15 students was given one type) describing the same area of English usage, w...
متن کاملDeveloping a Recommendation Framework for Tourist by Mining Geo-tag Photos (Case Study Tehran District 6)
With the increasing popularity of sharing media on social networks and facilitating access to location technologies, such as Global Positioning System (GPS), people are more interested to share their own photos and videos. The world wide web users are no longer the sole consumer but they are producers of information also, hence a wealth of information are available on web 2.0 applications. The ...
متن کامل